Unless you’re already an expert on the slendr toolkit, you will probably need some reference materials to do the exercises. Here are the options:
You can refer to the slides with the slendr crash course. You can find the material rendered as normal slides or one-page handouts (the latter being probably a bit more practical for reference.
Relevant tutorials on the slendr website which you can find under the “Articles” link in the header.
Manual pages of all available slendr functions. Note that you can get the help page of every slendr R function by typing ?function in the R console. For instance, typing ?ts_tajima gives you the help page of the tskit/slendr function implementing the tree-based computation of Tajima’s D.
Installation setup
The easiest way to set up everything on your computer is to do the following:
Clone the repository with the activity materials (source code with slides and exercises materials, example scripts, and solutions). In a shell terminal on Linux or macOS, in your home directory (or anywhere else, really) you can run:
Install all the R package dependencies by going into the activity repository directory you just cloned and installing the necessary R packages by following these steps:
First go into the project directory you just cloned:
$ cd ~/slendr_activity
Open the R terminal in that directory. You should get a note that the renv package is being automatically setup, like this:
$ R
[... R startup information stuff ...]
# Bootstrapping renv 1.0.11 --------------------------------------------------
- Downloading renv ... OK
- Installing renv ... OK
- Project '~/slendr_activity' loaded. [renv 1.0.11]
- One or more packages recorded in the lockfile are not installed.
- Use `renv::status()` for more details.
Install the R package dependencies (still in the R console!):
> renv::restore(prompt = FALSE)
Set up the Python environment used by the slendr R package for simulation and tree-sequence analysis (still in the R console!):
> slendr::setup_env(agree = TRUE)
If everything worked, you should get an optimistic message saying:
======================================================================
Python environment for slendr has been successfuly created, and the R
interface to msprime, tskit, and pyslim modules has been activated.
In future sessions, activate this environment by calling init_env().
=======================================================================
Open RStudio and navigate to your project directory via File -> Open Project....
If the setup_env() installation procedure fails, try the following:
Delete the failed installation attempt:
slendr::clear_env(force = TRUE, all = TRUE)
Try installing it again, this time using pip as a Python installation method (the default is conda which unfortunately fails fairly often):
slendr::setup_env(agree = TRUE, pip = TRUE)
In every previous installments of this workshop, this is all that was needed to resolve problems.
Installing SLiM
It’s unclear whether we will manage to go through the entirety of the final exercise. However, to be able to do this, having SLiM at least version 4.2 (and it being available in your unix $PATH!) is required. If this isn’t possible for your, don’t worry. You’ll be able to do most of that exercise even without SLiM, and I will demonstrate the whole exercise (including the SLiM bit) for you.
Organization of the exercises
For each exercise, you will get a brief explanation of the problem at hand and functions that could be useful to solve the exercise. The concrete, specific task will be always written like this in bold.
Your goal for each exercise is to write a complete R script script. I suggest you name the script for each exercise as exercise1.R, exercise2.R, etc., just to keep things tidy.
Unless you have a strong preference for another editor or IDE, I recommend you use RStudio (either on your machine or in the cloud, as provided by the workshop organizers, depending on where you did the setup steps above).
Each exercise is compose of individual parts, which are designed to build one upon the other in the order they are specified.
Note on the programming aspect of the exercises
All the exercises will involve “real coding”! If you’ve never really programmed entire scripts before, this could feel a little intimidating. Don’t worry. If you’re ever lost, just take a peek at the solution which is (by default hidden) under each exercise section. Always try to work on a solution on your own, but never let this be a barrier to your progress. Feel free to copy-paste bits of my solutions into your own scripts.
If you find yourself totally lost, don’t hesitate to read my solutions from the get go, copy-pasting them into your own script in the RStudio, executing them line by line, and trying to understand what’s going on.
Exercise 1: Programming demographic models with slendr
Part 1: Building a demographic model in slendr
Use functions such as population(), gene_flow(), and compile_model() to program the following toy model of human demographic history in slendr.
Hint: Start script exercise1.R script in your RStudio session using this “template”:
Note: With slendr you can specify time in whatever format is more convenient or readable for your model. For instance here, because we’re dealing with historical events which are commonly expressed in times given as”years ago”, we can write them in a decreasing order – i.e. 7Mya → 6Mya → …, as shown above – or, in terms of R code, 7e6 (or 7000000), 6e6 (6000000), etc..
Click to see the solution
library(slendr)init_env()## The interface to all required Python modules has been activated.# Chimpanzee outgroupchimp <-population("CHIMP", time =7e6, N =5000)# Two populations of anatomically modern humans: Africans and Europeansafr <-population("AFR", parent = chimp, time =6e6, N =15000)eur <-population("EUR", parent = afr, time =70e3, N =3000)# Neanderthal population splitting at 600 ky ago from modern humans# (becomes extinct by 40 ky ago)nea <-population("NEA", parent = afr, time =600e3, N =1000, remove =40e3)# Neanderthal introgression event (3% admixture between 55-50 kya)gf <-gene_flow(from = nea, to = eur, rate =0.03, start =55000, end =50000)# Compile the entire model into a single slendr R objectmodel <-compile_model(populations =list(chimp, nea, afr, eur),gene_flow = gf,generation_time =30,path = here::here("data/introgression"), # <--- don't worry about these twooverwrite =TRUE, force =TRUE# lines of code (ask me if interested))
Part 2: Inspecting the model visually
To visualize a slendr model, you can use the function plot_model(). Plot your compiled model to make sure you programmed it correctly!
Note: Plotting of models can be sometimes a little wonky. When plotting your model, experiment with arguments log = TRUE, proportions = TRUE, gene_flow = TRUE. Check ?plot_model for more information on these.
Once you have a compiled slendr model stored in an R variable (from now on, model will always mean a variable containing a compiled slendr model object, for simplicity), we can simulate data from it. By default, slendr models always produce a tree sequence.
There are two simulation engines built into slendr implemented by functions msprime() and slim(). For traditional, non-spatial, neutral demographic models, the engine provided by the msprime() function is much more efficient, so we’ll be using that.
Here’s how you can use it to simulate a tree sequence:
ts <-msprime(<model object>, sequence_length =<length of sequence to simulate [as bp]>, recombination_rate =<uniform recombination rate [per bp per generation]>)
The function has also arguments debug and run.
Simulate a tree sequence from your compiled model, storing it to a variable ts as shown right above. Then experiment with setting debug = TRUE (this prints out msprime’s own debugging summary) and then run = FALSE (this prints out a raw command-line which can run a slendr simulation in the shell).
Click to see the solution
# This simulates a tskit tree sequence from a slendr model. Note that you didn't have# to write any msprime or tskit Python code!ts <-msprime(model, sequence_length =1e6, recombination_rate =1e-8)# Setting `debug = TRUE` instructs slendr's built-in msprime script to print# out msprime's own debugger information. This can be very useful for debugging,# in addition to the visualization of the model as shown above.ts <-msprime(model, sequence_length =1e6, recombination_rate =1e-8, debug =TRUE)## DemographyDebugger## ╠═════════════════════════════════════╗## ║ Epoch[0]: [0, 1.67e+03) generations ║## ╠═════════════════════════════════════╝## ╟ Populations (total=4 active=4)## ║ ┌─────────────────────────────────────────────────────────────────────┐## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │ EUR │## ║ ├─────────────────────────────────────────────────────────────────────┤## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ EUR│ 3000.0│ 3000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ └─────────────────────────────────────────────────────────────────────┘## ╟ Events @ generation 1.67e+03## ║ ┌─────────────────────────────────────────────────────────────────────────────────────────┐## ║ │ time│type │parameters │effect │## ║ ├─────────────────────────────────────────────────────────────────────────────────────────┤## ║ │ 1666│Migration rate │source=EUR, dest=NEA, │Backwards-time migration rate from EUR │## ║ │ │change │rate=0.000179640718562 │to NEA → 0.00017964071856287425 │## ║ │ │ │87425 │ │## ║ └─────────────────────────────────────────────────────────────────────────────────────────┘## ╠════════════════════════════════════════════╗## ║ Epoch[1]: [1.67e+03, 1.83e+03) generations ║## ╠════════════════════════════════════════════╝## ╟ Populations (total=4 active=4)## ║ ┌───────────────────────────────────────────────────────────────────────────┐## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │ EUR │## ║ ├───────────────────────────────────────────────────────────────────────────┤## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ EUR│ 3000.0│ 3000.0│ 0 │ 0 │ 0 │ 0.0001796 │ 0 │## ║ └───────────────────────────────────────────────────────────────────────────┘## ╟ Events @ generation 1.83e+03## ║ ┌────────────────────────────────────────────────────────────────────────────────────────┐## ║ │ time│type │parameters │effect │## ║ ├────────────────────────────────────────────────────────────────────────────────────────┤## ║ │ 1833│Migration rate │source=EUR, dest=NEA, │Backwards-time migration rate from EUR │## ║ │ │change │rate=0 │to NEA → 0 │## ║ │┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈│## ║ │ 1833│Census │ │Insert census nodes to record the │## ║ │ │ │ │location of all lineages │## ║ └────────────────────────────────────────────────────────────────────────────────────────┘## ╠════════════════════════════════════════════╗## ║ Epoch[2]: [1.83e+03, 2.33e+03) generations ║## ╠════════════════════════════════════════════╝## ╟ Populations (total=4 active=4)## ║ ┌─────────────────────────────────────────────────────────────────────┐## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │ EUR │## ║ ├─────────────────────────────────────────────────────────────────────┤## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ │ EUR│ 3000.0│ 3000.0│ 0 │ 0 │ 0 │ 0 │ 0 │## ║ └─────────────────────────────────────────────────────────────────────┘## ╟ Events @ generation 2.33e+03## ║ ┌───────────────────────────────────────────────────────────────────────────┐## ║ │ time│type │parameters │effect │## ║ ├───────────────────────────────────────────────────────────────────────────┤## ║ │ 2333│Population │derived=[EUR], │Moves all lineages from the 'EUR' │## ║ │ │Split │ancestral=AFR │derived population to the ancestral │## ║ │ │ │ │'AFR' population. Also set 'EUR' to │## ║ │ │ │ │inactive, and all migration rates to │## ║ │ │ │ │and from the derived population to │## ║ │ │ │ │zero. │## ║ └───────────────────────────────────────────────────────────────────────────┘## ╠═════════════════════════════════════════╗## ║ Epoch[3]: [2.33e+03, 2e+04) generations ║## ╠═════════════════════════════════════════╝## ╟ Populations (total=4 active=3)## ║ ┌───────────────────────────────────────────────────────────────┐## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │## ║ ├───────────────────────────────────────────────────────────────┤## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │## ║ └───────────────────────────────────────────────────────────────┘## ╟ Events @ generation 2e+04## ║ ┌────────────────────────────────────────────────────────────────────────────┐## ║ │ time│type │parameters │effect │## ║ ├────────────────────────────────────────────────────────────────────────────┤## ║ │ 2e+04│Population │derived=[NEA], │Moves all lineages from the 'NEA' │## ║ │ │Split │ancestral=AFR │derived population to the ancestral │## ║ │ │ │ │'AFR' population. Also set 'NEA' to │## ║ │ │ │ │inactive, and all migration rates to │## ║ │ │ │ │and from the derived population to │## ║ │ │ │ │zero. │## ║ └────────────────────────────────────────────────────────────────────────────┘## ╠══════════════════════════════════════╗## ║ Epoch[4]: [2e+04, 2e+05) generations ║## ╠══════════════════════════════════════╝## ╟ Populations (total=4 active=2)## ║ ┌─────────────────────────────────────────────────────────┐## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │## ║ ├─────────────────────────────────────────────────────────┤## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │## ║ └─────────────────────────────────────────────────────────┘## ╟ Events @ generation 2e+05## ║ ┌──────────────────────────────────────────────────────────────────────────────┐## ║ │ time│type │parameters │effect │## ║ ├──────────────────────────────────────────────────────────────────────────────┤## ║ │ 2e+05│Population │derived=[AFR], │Moves all lineages from the 'AFR' │## ║ │ │Split │ancestral=CHIMP │derived population to the ancestral │## ║ │ │ │ │'CHIMP' population. Also set 'AFR' to │## ║ │ │ │ │inactive, and all migration rates to │## ║ │ │ │ │and from the derived population to │## ║ │ │ │ │zero. │## ║ └──────────────────────────────────────────────────────────────────────────────┘## ╠════════════════════════════════════╗## ║ Epoch[5]: [2e+05, inf) generations ║## ╠════════════════════════════════════╝## ╟ Populations (total=4 active=1)## ║ ┌─────────────────────────────────────────┐## ║ │ │ start│ end│growth_rate │## ║ ├─────────────────────────────────────────┤## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │## ║ └─────────────────────────────────────────┘# For debugging of technical issues (with msprime, with slendr, or both), it is# very useful to have the `msprime` function dump the "raw" command-line to# run the simulation on the terminal using plain Python interpretermsprime(model, sequence_length =1e6, recombination_rate =1e-8, run =FALSE)## /Users/mp/Library/r-miniconda-arm64/envs/Python-3.12_msprime-1.3.3_tskit-0.5.8_pyslim-1.0.4_tspop-0.0.2/bin/python /Users/mp/Projects/cesky-krumlov-2025/data/introgression/script.py --seed 897768123 --model /Users/mp/Projects/cesky-krumlov-2025/data/introgression --sequence-length 1000000 --recombination-rate 1e-08 --path <path to a .trees file>
Part 4: Inspecting the tree-sequence object
As we’ll see later, slendr tries to provide an R-friendly interface to accessing a subset of tskit’s functionality for working with tree sequence and for computing various popgen statistics.
For now, type out the ts object in the terminal – what do you see? You should get a summary of a tree-sequence object that you’re familiar with from your msprime and tskit activity.
Click to see the solution
# Typing out the object with the result shows that it's a good old tskit# tree-sequence objectts
Use the ts_table function to inspect the low-level table-based representation of a tree sequence. For instance, you can get the table of nodes with ts_table(ts, "nodes"), edges with ts_table(ts, "edges"), and do the same thing for “individuals”, “mutations”, and “sites”. Does your tree sequence contain any mutations? If not, why?
Click to see the solution
# slendr provides a helper function which allows access to all the low-level# components of every tree-sequence objectts_table(ts, "nodes")
# We didn't simulate any mutations, so we only have genealogies for now.ts_table(ts, "mutations")
# A tibble: 0 × 5
# ℹ 5 variables: id <dbl>, site <int>, node <int>, time <dbl>, time_tskit <dbl>
ts_table(ts, "sites")
# A tibble: 0 × 2
# ℹ 2 variables: id <dbl>, position <dbl>
There are also two slendr-specific functions called ts_samples() and ts_nodes(). You can run them simply as ts_samples(ts) and ts_nodes(ts). What do the tables they produce represent?
Click to see the solution
# slendr provides a convenient function `ts_samples()` which allows us to# inspect the contents of a simulated tree sequence in a more human-readable,# simplified way. We can see that our tree sequence contains a massive number# of individuals. Too many, in fact (we recorded every single individual alive# at the end of our simulation -- something we're unlikely to be ever lucky# enough to have, regardless of which species we study)ts_samples(ts) %>%nrow()
[1] 24000
# This function returns a table similar to the one produced by `ts_table(ts, "nodes")`# above, except that it contains additional slendr metadata (names of individuals# belonging to each node, spatial coordinates of nodes for spatial models, etc.).# It's a bit more useful for analyzing tree-sequence data than the "low-level" functions.ts_nodes(ts)
In the table produced by the ts_samples() function, you can see that the tree sequence we simulated record everyone. It’s very unlikely, unless we’re extremely lucky, that we’ll ever have a sequence of every single individual in a population that we study. To get a little closer to the scale of the genomic data that we’ll be working with, we can restrict our simulation to only record a subset of individuals.
We can precisely define which individuals (from which populations, and at which times) should be recorded in a tree sequence using the slendr function schedule_sampling(). For instance, if we have a model with some slendr populations in variables popXpopY, we can schedule the recording of 5 individuals from each at times 10000 (years ago) and 0 (present-day) by the following code:
pop_schedule <-schedule_sampling(model, times =c(10000, 0), list(popX, 5), list(popY), 5)
This function simply returns a data frame. As such, we can create multiple of such schedules (of arbitrary complexity and granularity), and then bind them together with a single line of code, like this:
# Note that the `times =` argument can be a vector of times...ancient_times <-c(40000, 30000, 20000, 10000)eur_samples <-schedule_sampling(model, times = ancient_times, list(eur, 1))# ... but also just a single numberafr_samples <-schedule_sampling(model, times =0, list(afr, 1), list(eur, 42))nea_samples <-schedule_sampling(model, time =60000, list(nea, 1))# And that multiple such sampling schedules can be bound together like this:samples <-rbind(eur_samples, afr_samples, nea_samples)
Using the function schedule_sampling (and with the help of rbind), program the sampling of the following sample sets at given times, saving it to variable called schedule:
time
population(s)
# individuals
70000
nea
1
40000
nea
1
0
chimp, afr, eur
5
Additionally, schedule the sampling of one eur individual at these times:
times <-seq(40000, 2000, by =-2000)
In total, you should schedule the recording of 38 individuals.
Click to see the solution
# Here we scheduled the sampling of two Neanderthals at 70kya and 40kyanea_samples <-schedule_sampling(model, times =c(70000, 40000), list(nea, 1))nea_samples # (this function produces a plain old data frame!)
# A tibble: 2 × 7
time pop n y_orig x_orig y x
<int> <chr> <int> <lgl> <lgl> <lgl> <lgl>
1 40000 NEA 1 NA NA NA NA
2 70000 NEA 1 NA NA NA NA
# Here we schedule one Chimpanzee sample, 5 African samples, and 10 European samplespresent_samples <-schedule_sampling(model, times =0, list(chimp, 1), list(afr, 5), list(eur, 10))# We also schedule the recording of one European sample between 50kya and 2kya,# every 2000 yearstimes <-seq(40000, 2000, by =-2000)emh_samples <-schedule_sampling(model, times, list(eur, 1))# Because those functions produce nothing but a data frame, we can bind# individual sampling schedules togetherschedule <-rbind(nea_samples, present_samples, emh_samples)schedule
# A tibble: 25 × 7
time pop n y_orig x_orig y x
<int> <chr> <int> <lgl> <lgl> <lgl> <lgl>
1 40000 NEA 1 NA NA NA NA
2 70000 NEA 1 NA NA NA NA
3 0 CHIMP 1 NA NA NA NA
4 0 AFR 5 NA NA NA NA
5 0 EUR 10 NA NA NA NA
6 2000 EUR 1 NA NA NA NA
7 4000 EUR 1 NA NA NA NA
8 6000 EUR 1 NA NA NA NA
9 8000 EUR 1 NA NA NA NA
10 10000 EUR 1 NA NA NA NA
# ℹ 15 more rows
Then, verify the correctness of your overall sampling schedule by visualizing it together with your model like this:
Note: As you’ve seen above, the visualization is often a bit wonky and convoluted with overlapping elements and it can be even worse with samples added, but try to experiment with arguments to plot_model described above to make the plot a bit more helpful for sanity checking.
Use your combined sampling schedule stored in the schedule variable to run a new tree-sequence simulation from your model (again using the msprime() function), this time restricted to just those individuals scheduled for recording. You can do this by providing the combined sampling schedule as the samples = argument of the function msprime you used above.
Click to see the solution
# The command below will likely take a few minutes to run, so feel free to go# down from 100 Mb sequence_length to even 10Mb (it doesn't matter much)# tstart <- Sys.time()ts <-msprime(model, sequence_length =100e6, recombination_rate =1e-8, samples = schedule, random_seed =1269258439) %>%ts_mutate(mutation_rate =1e-8, random_seed =1269258439)# tend <- Sys.time()# tend - tstart # Time difference of 2.141642 mins# If you're bothered by ho long this takes, feel free to call these two lines# to 100% reproduce my results without any expensive computation:model <-read_model(here::here("data/introgression"))ts <-ts_read(here::here(file ="data/introgression.trees"), model = model)# We can save a tree sequence object using a slendr function `ts_write` (this# can be useful if we want to save the results of a simulation for later use).dir.create("data", showWarnings =FALSE)ts_write(ts, "data/introgression.trees")
Inspect the tree-sequence object saved in the ts variable again, taking note of the (now much smaller!) number of individuals. You can also do a similar thing via the table produced by the ts_samples() function.
Click to see the solution
# Inspect the (tskit/Python-based) summary of the new tree sequencets
# Compute the count of individuals in different time pointssuppressPackageStartupMessages(library(dplyr))ts_samples(ts) %>%group_by(pop, time ==0) %>% tally %>%select(pop, n)
# A tibble: 5 × 2
# Groups: pop [4]
pop n
<chr> <int>
1 AFR 5
2 CHIMP 1
3 EUR 20
4 EUR 10
5 NEA 2
Bonus exercises
Bonus 1: Program your own model
Do you have a favourite model organism or you study another species? Take some of the most important features of its demographic history and program them as a new slendr model.*
Exercise 2: Computing popgen statistics on tree sequences from slendr
In this exercise, you will build on top of the results from Exercise 1. Specifically, we will learn how to compute popgen statistics on slendr-simulated tree sequences using slendr’s interface to the tskit Python module.
First, create a new R script exercise2.R and paste in the following code. This is one of the possible solutions to the Exercise 1, and it’s easier if we all use it to be on the same page from now on, starting from the same model and the same simulated tree sequence:
library(slendr)init_env()## The interface to all required Python modules has been activated.chimp <-population("CHIMP", time =7e6, N =5000)afr <-population("AFR", parent = chimp, time =6e6, N =15000)eur <-population("EUR", parent = afr, time =70e3, N =3000)nea <-population("NEA", parent = afr, time =600e3, N =1000, remove =40e3)gf <-gene_flow(from = nea, to = eur, rate =0.03, start =55000, end =50000)model <-compile_model(populations =list(chimp, nea, afr, eur),gene_flow = gf,generation_time =30)ts <-ts_read(here::here("data/introgression.trees"), model = model)cowplot::plot_grid(plot_model(model, proportions =TRUE),plot_model(model, proportions =TRUE, log =TRUE),nrow =1)
As a sanity check, let’s make sure the tree sequence does contain a set of sample which matches our intended sampling schedule (particularly the time series of European individuals and the two Neanderthals):
ts_samples(ts) %>%group_by(pop, time ==0) %>% tally %>%select(pop, n)
# A tibble: 5 × 2
# Groups: pop [4]
pop n
<chr> <int>
1 AFR 5
2 CHIMP 1
3 EUR 20
4 EUR 10
5 NEA 2
Everything looks good! Having made sure that the ts object contains the individuals we want, let’s move to the exercise.
Part 1: Computing nucleotide diversity
The model plotted above makes a fairly clear prediction on what would be the nucleotide diversity expected in the simulated populations. Compute the nucleotide diversity in all populations using the slendr function ts_diversity(). Do you get results you would expect from the model?
Hint: Nearly every slendr statistic function interfacing with tskit accepts a ts tree-sequence object as its first argument, and with further arguments being either a vector of individual names representing a group of samples to compute a statistic on, or a (named) list of such vectors (each element of that list for a group of samples) – these lists are intended to be equivalent to the sample_sets = argument of many tskit Python methods, except that they allow symbolic names of individuals, rather then integer indices of nodes in a tree sequence.
Although you can get all the above information by processing the table produced by the ts_samples() function, slendr provides a useful helper ts_names() which only returns the names of individuals.
When you call it directly, you get a plain vector of individual names:
This is not super helpful, unless we want to compute some statistic for everyone in the tree sequence, regardless of their population assignment. Perhaps a bit more useful is to call the function like this, because it will produce a result which can be immediately used as the sample_sets = argument mentioned in the Hint above:
As you can see, this gave us a normal R list, with each element containing a vector of individual names in a population. Note that we can use standard R list indexing to get subsets of individuals:
Many of the following exercises will use these kinds of tricks to instruct various slendr / tskit functions to compute statistics on subsets of all individuals sub-sampled in this way.
After you computed the nucleotide diversity per-population, compute it for each individual separately using the same function ts_diversity(), which gives you the heterozygosity for each individual.
Hint: You can do this by giving a vector of names as sample_sets = (so not an R list of vectors). Feel free to take a look at the solution if you find this distinction confusing.
Click to see the solution
Population-based nucleotide diversity:
# Let's first get a named list of individuals in each group we want to be# working with (slendr tree-sequence statistic functions generally operate# with this kind of structure)sample_sets <-ts_names(ts, split ="pop")sample_sets
# We can use such `sample_sets` object to compute nucleotide diversity (pi)\# in each population, in a bit of a similar manner to how we would do it# with the standard tskit in Pythonpi_pop <-ts_diversity(ts, sample_sets = sample_sets)arrange(pi_pop, diversity)
We can do this by passing the vector of individual names directory as the sample_sets = argument, rather than in a list of groups as we did above.
# For convenience, we first get a table of all individuals (which of course# contains also their names) and in the next step, we'll just add their# heterozygosities as a new column.pi_df <-ts_samples(ts)pi_df$name
# Let's plot the results using the ggplot2 package# (because a picture is worth a thousand numbers!)library(ggplot2)ggplot(pi_df, aes(pop, diversity, color = pop, group = pop)) +geom_boxplot(outlier.shape =NA) +geom_jitter() +theme_bw()
Part 2: Computing pairwise divergence
Use the function ts_divergence() to compute genetic divergence between all pairs of populations. Do you get results compatible with our demographic model?
Hint: Again, you can use the same concept of sample_sets = we discussed in the previous part.
Part 3: Detecting Neanderthal admixture in Europeans
Let’s now pretend its about 2008 and we’ve sequenced the first Neanderthal genome (and also have the genomes of a couple of people from Africa and Europe). We want to answer the most burning question of all evolutionary anthropology: “Do some people living today carry Neanderthal ancestry?”
Earlier you’ve learned about \(f\)-statistics of various kinds. You have also heard that an \(f_4\) statistic (or its equivalent \(D\) statistic) can be used as a test of “treeness”. Simply speaking, for some “quartet” of individuals or population samples, they can be used as a hypothesis test of whether the history of those samples is compatible with there not having been an introgression.
Compute the\(f_4\) test of Neanderthal introgression in EUR individuals using the slendr function ts_f4(). When you’re running it, you will have to provide individuals to compute the statistic on using a slightly different format. Take a look at the help page available as ?ts_f4 for more information. When you’re computing the values, make sure to also set mode = "branch" (we will get to why a bit later).
Hint: If you haven’t learned this in your \(f\)-statistics lecture, you want to compute (and compare) the values of these two statistics using the slendr function ts_f4():
\(f_4\)(<some African>, <a test European>; <Neanderthal>, <Chimp>)
Roughly speaking, we can understand this computation in terms of comparing the counts of so-called BABA and ABBA allele patterns between the quarted of samples specified in the statistics:
The first is not expected to give values “too different” from 0 because we don’t expect two African individuals to differ “significantly” in terms of how much alleles they share with a Neanderthal. The other should – if there was a Neanderthal introgression into Europeans some time in their history – be “significantly negative”.
Is the second of those two statistics “much more negative” than the first, as expected assuming introgression from Neanderthals into Europeans?
Why am I putting “significantly” and “much more negative” in quotes? What are we missing here for this being a true hypothesis test as you might be accustomed to from computing\(f\)-statistics using a tool such as ADMIXTOOLS? (We will get to this again in the following part of this exercise.)
Click to see the solution
# Compute the difference in the amount of allele sharing between two African# individuals and a Neanderthalf4_null <-ts_f4(ts, W ="AFR_1", X ="AFR_2", Y ="NEA_1", Z ="CHIMP_1", mode ="branch")f4_null
# A tibble: 1 × 5
W X Y Z f4
<chr> <chr> <chr> <chr> <dbl>
1 AFR_1 AFR_2 NEA_1 CHIMP_1 43.4
# Compute the difference in the amount of allele sharing between an African# individual vs European individual and a Neanderthalf4_alt <-ts_f4(ts, W ="AFR_1", X ="EUR_1", Y ="NEA_1", Z ="CHIMP_1", mode ="branch")f4_alt
# A tibble: 1 × 5
W X Y Z f4
<chr> <chr> <chr> <chr> <dbl>
1 AFR_1 EUR_1 NEA_1 CHIMP_1 -853.
# We can see that the second test resulted in an f4 value about ~20 times more# negative than the first test, indicating that a European in our test carries# "significantly more" Neanderthal alleles compared to the baseline expectation# of no introgression established by the comparison to an African ...f4_alt$f4 / f4_null$f4
[1] -19.65719
# ... although this is not a real test of significance (we have no Z-score or# standard error which would give us something like a p-value for the hypothesis# test, as we get by jackknife procedure in ADMIXTOOLS)
Part 5: Detecting Neanderthal admixture in Europeans v2.0
The fact that we don’t get something equivalent to a p-value in these kinds of simulations is generally not a problem, because we’re often interested in establishing a trend of a statistic under various conditions, and understanding when and how it behaves in a certain way. If statistical noise is a problem, we can easily work around it by computing a statistic on multiple simulation replicates or even increasing the sample sizes. I used this approach quite successfully on a related problem in this paper.
On top of that, p-value of something like an \(f\)-statistic (whether it’s significantly different from zero) is also strongly affected by quality of the data, sequencing errors, coverage, etc. (which can certainly be done using simulations!). However, these are an orthogonal issue which has little to do with the underlying evolutionary model in question anyway – and that’s what we’re after in our exercises.
In short, how much is “significantly different from zero compared to some baseline expectation” can be easily drowned in noise in a simple single-individual comparisons as we did above. Let’s increase the sample size a bit to see if a statistical pattern becomes more apparent.
Compute the first\(f_4\) statistic of the baseline expectation between a pair of Africans and the second \(f_4\) statistic comparing an African and a European, but this time on all recorded Africans and all recorded Europeans, respectively. Plot the distributions of those two sets of statistics. This should remove lots of the uncertainty and make a statistical trend stand out much more clearly.
Hint: Whenever you need to compute something for many things in sequence, looping is very useful. One way to do compute, say, an \(f_4\) statistic over many individuals is use this kind of pattern using R’s looping function lapply():
# Loop over vector of individual names and apply a given ts_f4() expression on eachlist_f4 <-lapply(c("ind_1", "ind_2", ...),function(x) ts_f4(ts, W ="AFR_1", X = x, Y ="NEA_1", Z ="CHIMP_1", mode ="branch"))# The above gives us a list of data frames, so we need to bind them all into a# single table for easier interpretation and visualizationdf_f4 <-do.call(rbind, list_f4)
Click to see the solution
# Let's compute the f4 statistic for all Africans and Europeans to see the# f4 introgression patterns more clearlyf4_afr <-lapply(sample_sets$AFR, function(x) ts_f4(ts, W ="AFR_1", X = x, Y ="NEA_1", Z ="CHIMP_1", mode ="branch")) %>%do.call(rbind, .)f4_afr
# A tibble: 5 × 5
W X Y Z f4
<chr> <chr> <chr> <chr> <dbl>
1 AFR_1 AFR_1 NEA_1 CHIMP_1 0
2 AFR_1 AFR_2 NEA_1 CHIMP_1 43.4
3 AFR_1 AFR_3 NEA_1 CHIMP_1 -106.
4 AFR_1 AFR_4 NEA_1 CHIMP_1 27.7
5 AFR_1 AFR_5 NEA_1 CHIMP_1 -0.896
f4_eur <-lapply(sample_sets$EUR, function(x) ts_f4(ts, W ="AFR_1", X = x, Y ="NEA_1", Z ="CHIMP_1", mode ="branch")) %>%do.call(rbind, .)f4_eur
# Let's add population columns to each of the two results, and bind them together# for plottingf4_afr$pop <-"AFR"f4_eur$pop <-"EUR"# Bind both tables togetherf4_results <-rbind(f4_afr, f4_eur)# Visualize the resultsf4_results %>%ggplot(aes(pop, f4, color = pop)) +geom_boxplot() +geom_jitter() +geom_hline(yintercept =0, linetype =2) +ggtitle("f4(AFR, EUR; NEA, CHIMP)") +theme_bw()
# Why the difference between the "branch" and "site" modes?# See this tutorial (and particularly directly the linked section):# https://tskit.dev/tutorials/no_mutations.html#genealogy-based-measures-are-less-noisy
Bonus exercises
Bonus 1: Outgroup \(f_3\) statistic
Use the function ts_f3() to compute the outgroup\(f_3\) statistic between pairs of African-European, African-Neanderthal, and European-Neanderthal and a Chimpanzee outgroup.
Hint: The \(f_3\) statistic is traditionally expressed as \(f_3(A, B; C)\), where C represents the outgroup. Unfortunately, in tskit the outgroup is named A, with B and C being the pair of samples from which we trace the length of branches towards the outgroup, so the statistic is interpreted as \(f_3(B, C; A)\).
How do the outgroup f3 results compare to your expectation based on simple population relationships (and to the divergence computation above)?
Do you see any impact of introgression on the $f_3 value when a Neanderthal is included in the computation?
Click to see the solution
# f3(A, B; C) = E[ (A - C) * (B - C) ]# This means that in tskit, C is the outgroup (different from ADMIXTOOLS!)# We can compute f3 for individuals...ts_f3(ts, B ="AFR_1", C ="EUR_1", A ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 CHIMP_1 AFR_1 EUR_1 0.00375
# ... but also whole populations (or population samples)ts_f3(ts, B = sample_sets["AFR"], C = sample_sets["EUR"], A ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 CHIMP_1 AFR EUR 0.00375
ts_f3(ts, B = sample_sets["AFR"], C = sample_sets["NEA"], A ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 CHIMP_1 AFR NEA 0.00358
ts_f3(ts, B = sample_sets["EUR"], C = sample_sets["NEA"], A ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 CHIMP_1 EUR NEA 0.00359
Bonus 2: Outgroup \(f_3\) statistic as a linear combination of \(f_2\) statistics
You might have learned that any complex \(f\)-statistic can be expressed as a linear combination of multiple \(f_2\) statistics (which represent simple branch length separating two lineages). Verify that this is the case by looking up equation (20b) in this amazing paper and compute an\(f_3\) statistic for any arbitrary trio of individuals of your choosing using this linear combination of \(f_2\) statistics.
Click to see the solution
# standard f3ts_f3(ts, B ="AFR_1", C ="AFR_2", A ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 CHIMP_1 AFR_1 AFR_2 0.00378
# a "homemade" f3 statistic as a linear combination of f2 statistics# f3(A, B; C) = f2(A, C) + f2(B, C) - f2(A, B) / 2homemade_f3 <- (ts_f2(ts, A ="AFR_1", B ="CHIMP_1")$f2 +ts_f2(ts, A ="AFR_2", B ="CHIMP_1")$f2 -ts_f2(ts, A ="AFR_1", B ="AFR_2")$f2) /2homemade_f3
[1] 0.003778796
Bonus 3: Trajectory of Neanderthal ancestry in Europe over time
There used to be a lot of controversy about the question of whether or not did Neanderthal ancestry proportion in Europeans decline or not over the past 40 thousand years (see figure 1 in this paper figure 2 in this paper).
Your simulated tree sequence contains a time-series of European individuals over time. Use the slendr function ts_f4ratio() to compute (and then plot) the proportion (commonly designated as alpha) of Neanderthal ancestry in Europe over time. Use \(f_4\)-ratio statistic of the following form:
ts_f4ratio(ts, X =<vector of ind. names>, A ="NEA_1", B ="NEA_2", C ="AFR_1", O ="CHIMP_1")
Click to see the solution
# Extract table with names and times of sampled Europeans (ancient and present day)eur_inds <-ts_samples(ts) %>%filter(pop =="EUR")eur_inds
# Compute f4-ration statistic (this will take ~30s) -- note that we can provide# a vector of names for the X sample set to the `ts_f4ratio()` functionnea_ancestry <-ts_f4ratio(ts, X = eur_inds$name, A ="NEA_1", B ="NEA_2", C ="AFR_1", O ="CHIMP_1")# Add the age of each sample to the table of proportionsnea_ancestry$time <- eur_inds$timenea_ancestry
nea_ancestry %>%ggplot(aes(time, alpha)) +geom_point() +geom_smooth(method ="lm", linetype =2, color ="red", linewidth =0.5) +xlim(40000, 0) +coord_cartesian(ylim =c(0, 0.1)) +labs(x ="time [years ago]", y ="Neanderthal ancestry proportion") +theme_bw() +ggtitle("Neanderthal ancestry proportion in Europeans over time")
`geom_smooth()` using formula = 'y ~ x'
# For good measure, let's test the significance of the decline using a linear modelsummary(lm(alpha ~ time, data = nea_ancestry))
Call:
lm(formula = alpha ~ time, data = nea_ancestry)
Residuals:
Min 1Q Median 3Q Max
-0.0102725 -0.0029417 0.0001505 0.0040630 0.0073792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.205e-02 1.214e-03 18.173 <2e-16 ***
time -1.031e-07 6.204e-08 -1.662 0.108
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.004642 on 28 degrees of freedom
Multiple R-squared: 0.08979, Adjusted R-squared: 0.05728
F-statistic: 2.762 on 1 and 28 DF, p-value: 0.1077
Bonus 4 – how many unique f4 quartets are there?
In your lecture about \(f\)-statistics, you’ve probably learned about various symmetries in \(f_4\) (but also other \(f\)-statistics) depending on the arrangement of the “quartet”. As a trivial example, an \(f_3(A; B, C)\) and \(f_3(A; C, B)\) will give you exactly the same value, and the same thing applies even to more complex \(f\)-statistics like \(f_4\).
Use simulations to compute how manu unique\(f_4\) values involving a single quartet are there.
Hint: Draw some trees to figure out why could that be true. Also, when computing ts_f4(), set mode = "branch" to avoid the effect of statistical noise due to mutations.
Click to see the solution
# # install a combinatorics R package# install.packages("combinat")library(combinat)
Attaching package: 'combinat'
The following object is masked from 'package:utils':
combn
# These are the four samples we can create quartet combinations fromquartet <-c("AFR_1", "EUR_1", "NEA_1", "CHIMP_1")quartets <-permn(quartet)quartets
# How many permutations there are in total?# 4! = 4 * 3 * 2 * 1 = 24factorial(4)
[1] 24
# We should therefore have 24 different quartet combinations of sampleslength(quartets)
[1] 24
# Loop across all quartets, computing the corresponding f4 statistic (we want# to do this using branch lengths, not mutations, as the mutation-based computation# would involve statistical noise)all_f4s <-lapply(quartets, function(q) ts_f4(ts, q[1], q[2], q[3], q[4], mode ="branch"))# Bind the list of f4 results into a single data frame and inspect the resultsall_f4s <-bind_rows(all_f4s) %>%arrange(abs(f4))print(all_f4s, n =Inf)
# A tibble: 3 × 6
W X Y Z f4 `abs(f4)`
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 AFR_1 EUR_1 NEA_1 CHIMP_1 -853. 853.
2 AFR_1 NEA_1 CHIMP_1 EUR_1 -16428. 16428.
3 AFR_1 CHIMP_1 EUR_1 NEA_1 17281. 17281.
Exercise 3: Simulation-based inference of \(N_e\) from AFS
What we did so far is learning how slendr models provide an easy way to encode demographic models in R and simulate (even very large!) tree sequences from them. This allows us to very quickly verify our intuition about some popgen problem (things like “Hmmm, I wonder what would an \(f_4\) statistic look like if my model includes this particular gene-flow event?), in just a few lines of R. We have been able to even answer some questions like this directly in a meeting, pretty much on the spot! This makes slendr a very powerful “popgen calculator”.
Now let’s take things one step forward. Imagine you gathered some empirical data, like an allele frequency spectrum (AFS). That data was, in the real world, produced by some (hidden) biological process (demographic history) that we generally don’t know anything about a priori For instance, the population we study had some \(N_e\), which we don’t know – the only thing we have is the observed AFS.
Simulations can be a great tool to estimate the most likely value of an unknown parameter. Staying with this \(N_e\) and AFS example, can simulate a large number of AFS vectors (each resulting from a different assumed \(N_e\) value) and then pick just those \(N_e\) values (or just one \(N_e\) value) which produced a simulated AFS closest to the observed AFS. This is exactly what you’ll be doing just now in Exercise 3.
Part 1: A self-contained slendr function of \(N_e \rightarrow \textrm{AFS}\)
In a new script exercise3.R write a custom R function called simulate_afs(), which will take Ne as its only parameter. Use this function to compute AFS vectors for a couple of Ne values of your choosing, but staying between Ne = 1000 and Ne = 30000 Plot those AFS vectors and observe how (and why?) do they differ based on Ne parameter you used in each respective simulation.
Hint: The function should create a one-population forward-time model (our population starting at time = 1, with the model simulation_length 100000 and generation_time 1), simulate 10Mb tree sequence (recombination and mutation rates of 1e-8), compute AFS for 10 samples and return it the AFS vector as its result.
Hint:
If you’ve never programmed before, the concept of a “custom function” might be very alien to you. If you need help, feel free to start building your exercise3.R solution based on this “template” (just fill in missing relevant bits of slendr code that you should be already familiar with):
library(slendr)init_env()simulate_afs <-function(Ne) {# In here you should write code which will compile a single-population# demographic model based on the parameters specified above. It should# then simulate 10Mb of tree sequence using `msprime()`, compute an AFS# vector using the function `ts_afs()` for any subset of 10 individuals# in the tree sequence, and finally return that vector.return(result) # `result` is a variable with your 10-sample AFS, return it}afs_1 <-simulate_afs(Ne =1000) # simulate AFS from a Ne = 1000 model...plot(afs_1, type ="o") # ... and plot it
Note: Remember that you should drop the first element of the AFS vector produced by ts_afs() (for instance with something like result[-1] if result contains the output of ts_afs()) technical reasons related to tskit. You don’t have to worry about that here, but you can read this for more detail.
When used in R, your function should work like this:
# An R function can be understood as a block of a computer program which executes# a block of code inside the {...} brackets given a certain value of a parameter# (here 'Ne' just after the word 'function')simulate_afs <-function(Ne) {# create a slendr model with a single population of size Ne = N pop <-population("pop", N = Ne, time =1) model <-compile_model(pop, generation_time =1, simulation_length =100000)# simulate a tree sequence ts <-msprime(model, sequence_length =10e6, recombination_rate =1e-8) %>%ts_mutate(mutation_rate =1e-8)# get a random sample of names of 10 individuals samples <-ts_names(ts) %>%sample(10)# compute the AFS vector (dropping the 0-th element added by tskit) afs <-ts_afs(ts, sample_sets =list(samples))[-1] afs}# Let's use our custom function to simulate AFS vector for Ne = 1k, 10k, and 30kafs_1k <-simulate_afs(1000)afs_10k <-simulate_afs(10000)afs_30k <-simulate_afs(30000)# Plot the three simulated AFS using base R plotting functionalityplot(afs_30k, type ="o", main ="AFS, Ne = 30000", col ="cyan",)lines(afs_10k, type ="o", main ="AFS, Ne = 10000", col ="purple")lines(afs_1k, type ="o", main ="AFS, Ne = 1000", col ="blue")legend("topright", legend =c("Ne = 1k", "Ne = 10k", "Ne = 30k"),fill =c("blue", "purple", "cyan"))
Part 2: Estimating unknown \(N_e\) from empirical AFS
Imagine you sequenced 10 samples from a population and computed the following AFS vector (which contains, sequentially, the number of singletons, doubletons, etc.):
You know (maybe from some fossil evidence) that the population probably had a constant \(N_e\) somewhere between 1000 and 30000 for the past 100,000 generations, and had mutation and recombination rates of 1e-8 (i.e., parameters already implemented by your simulate_afs() function).
Use slendr simulations to guess the true (and hidden!) \(N_e\) given the observed AFS by running simulations for a range of \(N_e\) values and comparing each run to the afs_observed vector above.
Hints:
Depending on your level of comfort with R (and programming in general), you can choose one of the following approaches:
Option 1 [easy]: Plot AFS vectors for various \(N_e\) values, then eyeball which looks closest to the observed AFS based on the figures alone.
Option 2 [hard]: Simulate AFS vectors in steps of possible Ne (maybe lapply()?), and find the \(N_e\) which gives the closest AFS to the observed AFS based on Mean squared error.
Click to see the solution to “Option 1”
# This is our starting observed AFS which we want to compare simulated AFS vectors toafs_observed <-c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,264, 218, 133, 173, 159, 142, 167, 129, 125, 143)# We know that the Ne is between 1000 and 30000, so let's simulate# a bunch of AFS vectors for different Ne valuesafs_Ne1k <-simulate_afs(Ne =1000)afs_Ne5k <-simulate_afs(Ne =5000)afs_Ne6k <-simulate_afs(Ne =6000)afs_Ne10k <-simulate_afs(Ne =10000)afs_Ne20k <-simulate_afs(Ne =20000)afs_Ne30k <-simulate_afs(Ne =30000)# Plot all simulated AFS vectors, highlighting the observed AFS in blackplot(afs_observed, type ="b", col ="black", lwd =3,xlab ="allele count bin", ylab ="count", ylim =c(0, 13000))lines(afs_Ne1k, lwd =2, col ="blue")lines(afs_Ne5k, lwd =2, col ="green")lines(afs_Ne6k, lwd =2, col ="pink")lines(afs_Ne10k, lwd =2, col ="purple")lines(afs_Ne20k, lwd =2, col ="orange")lines(afs_Ne30k, lwd =2, col ="cyan")legend("topright",legend =c("observed AFS", "Ne = 1000", "Ne = 5000","Ne = 6000", "Ne = 10000", "Ne = 20000", "Ne = 30000"),fill =c("black", "blue", "green", "pink", "purple", "orange", "cyan"))
# !!!!! SPOILER ALERT BEFORE REVEALING THE CORRECT ANSWER !!!!!################################################ true Ne was 6543!
Click to see the solution to “Option 1”
# This is our starting observed AFS which we want to compare simulated AFS vectors toafs_observed <-c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,264, 218, 133, 173, 159, 142, 167, 129, 125, 143)# Generate regularly spaced values of potential Ne values (our parameter grid)Ne_grid <-seq(from =1000, to =30000, by =500)Ne_grid
library(parallel)# Compute AFS (in parallel, to make things faster) across the entire grid of possible Ne valuesafs_grid <-mclapply(Ne_grid, simulate_afs, mc.cores =detectCores())names(afs_grid) <- Ne_grid# Show the first five simulated AFS vectors, for brevityafs_grid[1:5]
# Plot the observed AFS...plot(afs_observed, type ="b", col ="black", lwd =3, xlab ="allele count bin", ylab ="count")# ... and overlay the simulated AFS vectors on top of itfor (i inseq_along(Ne_grid)) {lines(afs_grid[[i]], lwd =0.5)}legend("topright", legend =c("observed AFS", "simulated AFS"), fill =c("black", "gray"))
# Compute mean-squared error of the AFS produced by each Ne value across the griderrors <-sapply(afs_grid, function(sim_afs) {sum((sim_afs - afs_observed)^2) /length(sim_afs)})plot(Ne_grid, errors, ylab ="error")abline(v = Ne_grid[which.min(errors)], col ="red")legend("topright", legend =paste("minimum error Ne =", Ne_grid[which.min(errors)]), fill ="red")
# Plot the AFS again, but this time highlight the most likely spectrum# (i.e. the one which gave the lowest RMSE value)plot(afs_observed, type ="b", col ="black", lwd =3, xlab ="allele count bin", ylab ="count")for (i inseq_along(Ne_grid)) { color <-if (i ==which.min(errors)) "red"else"gray" width <-if (i ==which.min(errors)) 2else0.75lines(afs_grid[[i]], lwd = width, col = color)}legend("topright", legend =c("observed AFS", paste("best fitting Ne =", Ne_grid[which.min(errors)])),fill =c("black", "red"))
# !!!!! SPOILER ALERT BEFORE REVEALING THE CORRECT ANSWER !!!!!################################################ true Ne was 6543!
Exercise 4: Simulating dynamics of positive selection
My primary motivation for designing slendr was to make demographic modelling in R as trivially easy and fast as possible, focusing exclusively on neutral models. However, after slendr became popular, people have been asking me about the possibility of simulating natural selection. After all, a large part of slendr’s functionality deals with population genetic models across geographical landscapes, which requires SLiM. Even though these models are neutral, SLiM is the most famous and popular tool for simulating various aspects of natural selection.
Recently I caved in and added support for modifying slendr demographic models with bits of SLiM code which allows simulating pretty much any arbitrary selection scenario you might be interested in. This exercise is a quick demonstration of how this work and how you might simulate selection using slendr. We will do this using another toy model of ancient human history, which we will use as a basis for simulating the frequency trajectory of an allele under positive selection, and implementing a simpleselection scan summary statistic using Tajima’s D.
To speed things up, create a new exercise4.R script and copy the following code as a starting point for this exercise:
library(slendr)init_env()## The interface to all required Python modules has been activated.# This line sources a script in which I provide a few useful helper functions# which you can use in this exercisesource(here::here("utils.R"))# African ancestral populationafr <-population("AFR", time =65000, N =5000)# First migrants out of Africaooa <-population("OOA", parent = afr, time =60000, N =5000, remove =27000)# Eastern hunter-gatherersehg <-population("EHG", parent = ooa, time =28000, N =5000, remove =6000)# European populationeur <-population("EUR", parent = ehg, time =25000, N =5000)# Anatolian farmersana <-population("ANA", time =28000, N =5000, parent = ooa, remove =4000)# Yamnaya steppe populationyam <-population("YAM", time =8000, N =5000, parent = ehg, remove =2500)# Define gene-flow eventsgf <-list(gene_flow(from = ana, to = yam, rate =0.75, start =7500, end =6000),gene_flow(from = ana, to = eur, rate =0.5, start =6000, end =5000),gene_flow(from = yam, to = eur, rate =0.6, start =4000, end =3500))# Compile all populations into a single slendr model objectmodel <-compile_model(populations =list(afr, ooa, ehg, eur, ana, yam),gene_flow = gf, generation_time =30)# Schedule the sampling from four European populations roughly before their# disappearance (or before the end of the simulation)schedule <-rbind(schedule_sampling(model, times =0, list(eur, 50)),schedule_sampling(model, times =6000, list(ehg, 50)),schedule_sampling(model, times =4000, list(ana, 50)),schedule_sampling(model, times =2500, list(yam, 50)))
When you visualize the model, you might recognize it as a very simplified model of demographic history of Europe over the past 50 thousand years or so. As you can see, we are recording 50 individuals from four populations – for EURopeans we sample 50 individuals at “present-day”, for the remaining populations we’re recording 50 individuals just before their disappearance. Also note that there’s quite a bit of gene-flow! This was an important thing we’ve learned about human history in the past 10 years or so – everyone is mixed with pretty much everyone, there isn’t (and never was) anything as a “pure population”.
Part 1: Simulating a tree sequence and computing Tajima’s D
Although the point of this exercise is to simulate selection, let’s first simulate a normal neutral model using slendr’s msprime() engine as a sanity check. Simulate 10 Mb of sequence with a recombination rate 1e-8 and a sampling schedule defined above.
Inspect the table of all individuals recorded in our tree sequence using the function ts_samples(), making use we have all the individuals scheduled for tree-sequence recording.
# A tibble: 4 × 3
# Groups: pop [4]
pop time n
<chr> <dbl> <int>
1 ANA 4000 50
2 EHG 6000 50
3 EUR 0 50
4 YAM 2500 50
As you’ve learned, tskit functions in slendr generally operate on vectors (or lists) of individual names, like those produced by ts_names() above. Get a vector of names of individuals in every population recorded in the tree sequence, then use this to compute Tajima’s D using the slendr function ts_tajima().
Do you see any striking differences in the Tajima’s D values across populations? Check this for some general guidance.
# Compute genome-wide Tajima's D for each population -- note that we don't# expect to see any significant differences because no population experienced# natural selection (yet)ts_tajima(ts, sample_sets = samples, mode ="branch")
# A tibble: 4 × 2
set D
<chr> <dbl>
1 ANA -0.0960
2 EHG -0.0104
3 EUR -0.493
4 YAM -0.333
Part 2: Computing Tajima’s D in windows
Let’s take this one step forward. Even if there is a locus under positive selection somewhere along our chromosome, it might be quite unlikely that we would find a Tajima’s D value significant enough for the entire chromosome. Fortunately, thanks to the flexibility of the tskit module, the slendr function ts_tajima() has an argument windows =, which allows us to specify the coordinates of windows into which a sequence should be broken into, with Tajima’s D computed separately for each window. Perhaps this will allow us to see the impact of positive selection?
Define a variable windows which will contain a vector of coordinates of 100 windows, starting at position 0, and ending at position 10e6 (i.e. end of our chromosome). Then provide this variable as the windows = argument of ts_tajima(). Save the result of ts_tajima() into the variable tajima_wins, and inspect its contents in the R console.*
Hint: You can use the R function seq() and its argument length.out = 100, to create the coordinates of window boundaries very easily.
Click to see the solution
# Pre-compute genomic windows for window-based computation of Tajima's Dwindows <-round(seq(0, ts$sequence_length, length.out =100))windows
# Compute genome-wide Tajima's D for each population in individual windowstajima_wins <-ts_tajima(ts, sample_sets = samples, windows = windows, mode ="branch")tajima_wins
# A tibble: 4 × 2
set D
<chr> <named list>
1 ANA <dbl [99]>
2 EHG <dbl [99]>
3 EUR <dbl [99]>
4 YAM <dbl [99]>
# You can see that the format of the result is slightly strange, with the# `D` column containing a vector of numbers (this is done for conciseness)tajima_wins[1, ]$D
The default output format of ts_tajima() is not super user-friendly. Process the result using a helper function process_tajima(tajima_wins) that I provided for you (perhaps save it as tajima_df), and visualize it using another of my helper functions plot_tajima(tajima_df).
Click to see the solution
# The helper function `process_tajima()` reformats the results into a normal# data frame, this time with a new column `window` which indicates the index# of the window that each `D` value was computed intajima_df <-process_tajima(tajima_wins)tajima_df
# A tibble: 396 × 3
set D window
<chr> <dbl> <int>
1 ANA -0.0446 1
2 ANA 0.711 2
3 ANA 0.392 3
4 ANA -0.474 4
5 ANA 0.249 5
6 ANA -0.335 6
7 ANA 0.0962 7
8 ANA 0.312 8
9 ANA 0.387 9
10 ANA 0.546 10
# ℹ 386 more rows
# Now let's visualize the window-based Tajima's D along the simulated genome# using another helper function `plot_tajima()` (hint: we still don't expect# anything interesting here because we ran a purely neutral simulation, so this# is more like a control)plot_tajima(tajima_df)
Part 3: Adding positive selection to the base demographic model
Although primarily designed for neutral demographic models, slendr allows optional simulation of natural selection by providing a “SLiM extension code snippet” with customization SLiM code as an optional argument extension = of compile_model() (a function you’re closely familiar with at this point).
Unfortunately we don’t have any space to explain SLiM here (and I have no idea, at the time of writing, whether or not you will have worked with SLiM earlier in this workshop). Suffice to say that SLiM is another very popular population genetic simulator software which allows simulation of selection, and which requires you to write custom code in a different programming language called Eidos.
Take a look at the file slim_extension.txt provided in your working directory (it’s also part of the GitHub repository here). If you worked with SLiM before, glance through the script casually and see if it makes any sense to you. If you have not worked with SLiM before, look for the strange {elements} in curly brackets in the first ten lines of the script. Those are the parameters of the selection model we will be customizing are standard neutral demographic model from in the next step.
The SLiM extension script has three parameters:
origin_pop – in which population should a beneficial allele appear,
s – what should be the selection coefficient of the beneficial allele, and
onset_time – at which time should the allele appear in the origin_pop.
However, at the start, the SLiM extension snippet doesn’t contain any concrete values of those parameters, but only their {origin_pop}, {s}, and {onset_time} placeholders.
Use the slendr function substitute_values() to substitute concrete values for those parameters like this:
You can see that substitute_values() returned a path to a file. Take a look at that file in your terminal – you should see each of the three {placeholder} parameters replaced with a concrete given value.
Click to see the solution
Let’s take a look at the first 15 lines of the extension file before and after calling substitute_values(). We’ll do this in R for simplicity, but you can use less in plain unix terminal.
# Before -- see the {{placeholder}} parameters in their original formcat(paste(readLines("slim_extension.txt")[1:15], collapse ="\n"))
// Define model constants (to be substituted) all in one place
// (each {{placeholder}} will be replaced by a value passed from R).
// Note that string constant template patterns are surrounded by "quotes"!
initialize() {
defineConstant("s", {{s}});
defineConstant("onset_time", {{onset_time}});
defineConstant("origin_pop", "{{origin_pop}}");
// compose a trajectory file based on given parameters
defineConstant("traj_file", PATH + "/" + "trajectory.tsv");
}
// Because we want to simulate non-neutral evolution, we have to provide a
// custom initialization callback -- slendr will use it to replace its default
// neutral genomic architecture (i.e. the initialize() {...} callback it uses
# After -- see the {{placeholder}} parameters with concrete values!cat(paste(readLines(extension)[1:15], collapse ="\n"))
// Define model constants (to be substituted) all in one place
// (each {{placeholder}} will be replaced by a value passed from R).
// Note that string constant template patterns are surrounded by "quotes"!
initialize() {
defineConstant("s", 0.15);
defineConstant("onset_time", 12000);
defineConstant("origin_pop", "EUR");
// compose a trajectory file based on given parameters
defineConstant("traj_file", PATH + "/" + "trajectory.tsv");
}
// Because we want to simulate non-neutral evolution, we have to provide a
// custom initialization callback -- slendr will use it to replace its default
// neutral genomic architecture (i.e. the initialize() {...} callback it uses
How do we use the SLiM extension for our simulation? It’s very simple – we just have to provide the extension variable as an additional argument of good old compile_model(). This will compile a new slendr model which will now include the new functionality for simulating natural selection:
Compile a new model of the history of populations afr, ooa, ehg, etc., by following the instructions above.
Click to see the solution
model <-compile_model(populations =list(afr, ooa, ehg, eur, ana, yam),gene_flow = gf, generation_time =30,extension = extension # <======== this is missing in the neutral example!)
Part 4: Running a selection simulation using slim()
Now we can finally run our selection simulation!
There are two modifications to our previous simulation workflows:
Because we need to run a non-neutral simulation, we have to switch from using the msprime()slendr engine to slim(). The latter can still interpret the same demographic model we programmed in R, just like the msprime() engine can, but will run the model using SLiM (and thus leveraging the new SLiM extension code that we have customized using substituta_values() above). We simply do this by switching from this:
The customized model will not only produce a tree sequence, but will also generate a table of allele frequencies in each population (SLiM experts might have noticed the revelant SLiM code when they were inspecting slim_extension.txt). We need to be able to load both of these files after the simulation and thus need a path to a location we can find those files. We can do this by specifying calling the slim() function as path <- slim(..., path = TRUE) (so with the extra path = argument). This will return a path to where the slim() engine saved all output files.
Run a simulation from the modified model of selection with the slim() engine as instructed in 1. and 2. above, then use the list.files(path) function in R to take a look in the directory. Which files were produced by the simulation?
Click to see the solution (you have a working SLiM installation)
# tstart <- Sys.time()path <-slim(model, sequence_length =10e6, recombination_rate =1e-8, samples = schedule, path =TRUE, random_seed =59879916)# tend <- Sys.time()# tend - tstart # Time difference of 38.82014 secs# We can verify that the path not only contains a tree-sequence file but also# the table of allele frequencies.list.files(path)
[1] "slim.trees" "trajectory.tsv"
Click to see the solution (you don’t have a working SLiM installation)
# If you don't have SLiM set up, just use the simulated results from my own# run of the same simulationpath <- here::here("data/selection")# We can verify that the path not only contains a tree-sequence file but also# the table of allele frequencies.list.files(path)
[1] "slim.trees" "trajectory.tsv"
Part 5: Investigating allele frequency trajectories
Use another helper function read_trajectory(path) which I provided for this exercise to read the simulated frequency trajectories of the positively selected mutation in all of our populations into a variable traj_df. Then run a second helper function plot_trajectory(traj_df) to inspect the trajectories visually. The allele was programmed to occur in EURopeans 15 thousand years ago, and was configured to have been under very strong selection of \(s = 0.15\). Do the trajectories make sense given the demographic model of European prehistory plotted above?
Click to see the solution
traj_df <-read_trajectory(path)traj_df
# A tibble: 1,604 × 4
time pop freq onset
<dbl> <fct> <dbl> <dbl>
1 11990 EHG 0 12000
2 11990 ANA 0 12000
3 11990 EUR 0.0001 12000
4 11990 YAM NA 12000
5 11960 EHG 0 12000
6 11960 ANA 0 12000
7 11960 EUR 0.0001 12000
8 11960 YAM NA 12000
9 11930 EHG 0 12000
10 11930 ANA 0 12000
# ℹ 1,594 more rows
plot_trajectory(traj_df)
Warning: Removed 554 rows containing missing values or values outside the scale range
(`geom_line()`).
# Comparing the trajectories side-by-side with the demographic model reveals# some obvious patterns of both selection and demographic history.plot_grid(plot_model(model),plot_trajectory(traj_df),nrow =1, rel_widths =c(0.7, 1))
Warning: Removed 554 rows containing missing values or values outside the scale range
(`geom_line()`).
Part 6: Tajima’s D (genome-wide and window-based) from the selection model
Our simulation run saved results in the location stored in the path variable.
list.files(path)
[1] "slim.trees" "trajectory.tsv"
From this path, we’ve already successfuly investigated the frequency trajectories.
Now let’s compute Tajima’s D on the tree sequence simulated from our selection model. Hopefully we should see an interesting pattern in our selection scan? For instance, we don’t know yet where in the genome is the putative locus under selection!
To read a tree sequence simulated with slim() in the customized setup, we need to do a bit of work. To simplify things a bit, here’s the R code which makes it possible to do. Just copy it in your exercise4.R script as it is:
ts <-file.path(path, "slim.trees") %>%# 1. compose full path to the slim.trees filets_read(model) %>%# 2. read the tree sequence file into Rts_recapitate(Ne =5000, recombination_rate =1e-8) # 3. perform recapitation
Very briefly, because our tree sequence was generated by SLiM, it’s very likely that not all genealogies along the simulated genome will be fully coalesced (i.e., not all tree will have a single root). To explain why this is the case is out of the scope of this session, but read here if you’re interested in learning more. For the time being, it suffices to say that we can plug the (uncoalesced) tree sequence into the ts_recapitate() function, which then takes a SLiM tree sequence and simulates all necessary “ancestral history” that was missing on the uncoalesced trees, thus ensuring that the entire tree sequence is fully coalesced and can be correctly computed on.
Now that you have a ts tree sequence object resulting from a new selection simulation run, repeat the analyses of genome-wide and window-based Tajima’s D from Part 1 and Part 2 of this exercise, again using the provided helper functions process_tajima() and plot_tajima(). Can you identify which locus has been the likely focal point of the positive selection? Which population shows evidence of selection? Which doesn’t and why (look again at the visualization of the demographic model above)?
# Overall Tajima's D across the 10Mb sequence still doesn't reveal any significant# deviations even in case of selection (again, not entirely unsurprising)ts_tajima(ts, sample_sets = samples, mode ="branch")
# A tibble: 4 × 2
set D
<chr> <dbl>
1 ANA 0.0267
2 EHG 0.0266
3 EUR -0.318
4 YAM -0.340
# So let's look at the window-based computation again...windows <-as.integer(seq(0, ts$sequence_length, length.out =100))# compute genome-wide Tajima's D for each population in individual windowstajima_wins <-ts_tajima(ts, sample_sets = samples, windows = windows, mode ="branch")tajima_df <-process_tajima(tajima_wins)plot_tajima(tajima_df)
You should see a clear dip in Tajima’s D around the midpoint of the DNA sequence, but only in Europeans. The beneficial allele appeared in the European population, and although the plot of the allele frequency trajectories shows that the selection dynamics has been dramatically affected by gene-flow events (generally causing a repeated “dilution” of the selection signal in Europeans), there has never been gene-flow (at least in our model) from Europeans to other populations, so the beneficial allele never had a chance to “make it” into those populations.
Bonus exercises
Bonus 1: Program your own model
Vary the uniform recombination rate and observe what happens with Tajima’s D in windows along the genome.
Click to see the solution
Solution: just modify the value of the recombination_rate = argument provided to the slim() function above.
Bonus 2: Simulate origin of the allele in EHG
Simulate the origin of the beneficial allele in the EHG population – what do the trajectories look like now? How does that change the Tajima’s D distribution along the genome in our European populations?
Click to see the solution
Use this extension in the slim() call, and repeat the rest of the selection-based workflow in this exercise.